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Sr5 ' Abstract 
CN 

( Wc study the problem of learning Markov decision processes with finite state and action 

spaces when the transition probability distributions and loss functions are chosen adver- 
sarially and are allowed to change with time. We introduce an algorithm whose regret with 
^vq \ respect to any policy in a comparison class grows as the square root of the number of rounds 

of the game, provided the transition probabilities satisfy a uniform mixing condition. Our 
approach is efficient as long as the comparison class is polynomial and we can compute 
expectations over sample paths for each policy. Designing an efficient algorithm with small 
\« J • regret for the general case remains an open problem, 

h- ] 

O 1 1 Notation 

Let X be & finite state space and A be a finite action space. Let A# be the space of probability 
distributions over set S. Define a policy n as a mapping from the state space to A^, it : X — > A^. 
We use 7r(a|a:) to denote the probability of choosing action a in state x under policy tt. A random 
action under policy ir is denoted by tt(x). A transition probability kernel (or transition model) m 
is a mapping from the direct product of the state and action spaces to Ax- m : X x A — > Ax- 
[ Let -P(7r, m) be the transition probability matrix of policy 7r under transition model m. A loss 

function is a bounded real-valued function over state and action spaces, £ : X x A — > R. For a 
vector v, define = J2i \ v i\- For a real-valued function / defined over X x A, define 



CO 

, max l£ ^ a )l- ^ ne mner product between two vectors v and w is denoted by (v, w). 

2 Introduction 
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k> ■ Consider the following game between a learner and an adversary: at round t, the learner chooses a 

policy TT t from a policy class II. In response, the adversary chooses a transition model m t from a set 
of models M and a loss function £ t . The learner takes action at ~ n t (.\xt), moves to state Xt+i ~ 
m t (.\x t ,at) and suffers loss £t(xt,at). To simplify the discussion, we assume that the adversary is 
oblivious, i.e. its choices do not depend on the previous choices of the learner. We assume that 
£t G [0,1]- In this paper, we study the full-information version of the game, where the learner 
observes the transition model m t and the loss function £ t at the end of round t. The game is shown 
in Figure Q] The objective of the learner is to suffer low loss over a period of T rounds, while the 
performance of the learner is measured using its regret with respect to the total loss he would have 
ac hieved had he fol l owed the stationary policy in the comparison class LI minimizing the total loss. 

lEven-Dar et"al] (|2004D prove a hardness result for MDP problems with adversarially chosen 
transition models. Their proof, however, seems to have gaps as it assumes that the learner chooses a 
deterministic policy before observing the state at each round. Note that an online learning algorithm 
only needs to choose an action at the current state and does not need to construct a complete 
deterministic policy at each round. Their hardness result applies to deterministic transition models, 
while we make a mixing assumption in our analysis. Thus, it is still an open problem whether it is 
po ssible to obtain a compu tationally efficient algorithm with a sublinear regret. 

lYu and Mannorl (|2009al fbT) study the same setting, but obtain only a regret bound that scales 
with the amount of variation in the transition models. This regret bound can grow linearly with 
time. 



Initial state: Xq 
for t := 1,2,... do 

Learner chooses policy n t 

Adversary chooses model m t and loss function £ t 
Learner takes action at ~ ir t (.\xt) 
Learner suffers loss £ t (x t ,a t ) 
Update state xt+i ~ mt(.|xt, at) 
Learner observes m t and £ t 
end for 



Figure 1: Online Markov Decision Processes 



lEven-Dar et al.1 (|2009T) prove regret bounds for MDP problems with a fixed and known transition 
model and adversarially chosen loss functions. In this paper, we prove regret bounds for MDP 
problems with adversarially chosen transition models and loss functions. We are not aware of any 
earlier regret bound for this setting. Our approach is efficient as long as the comparison class is 
polynomial and we can compute expectations over sample paths for each policy. 

MDPs with changing transition kernels are good models for a wide range of problems, including 
dialogue systems, clinical trials, portfolio optimization, two player games such as poker, etc. 

3 Online MDP Problems 

Let A be an online learning algorithm that generates a policy n t at round t. Let x^ be the state 
at round t if we have followed the policies generated by algorithm A. Similarly, x\ denotes the state 
if we have chosen the same policy tt up to time t. Let £{x,tt) = £(x,tt(x)). The regret of algorithm 
A up to round T with respect to any policy tt £ II is defined by 

T T 

R T (A, tt) = E tt (xf, a t ) - E it K . , 
t=i t=i 

where at = TT t (xf-). Note that the regret with respect to tt is defined in terms of the sequence of 
states Xt that would have been visited under policy tt. Our objective is to design an algorithm that 
achieves low regret with respect to any policy tt. 

In th e absence of state variables, th e problem reduces to a full information online learning 
problem (jCesa-Bianchi and Lugosil . 120061 ) . The difficulty with MDP problems is that, unlike the full 
information online learning problems, the choice of policy at each round changes the future states 
and losses. The main idea behind the design and the analysis of our algorithm is the following regret 
decomposition: 



r t (a, tt) = e ^ i x t > a t ) - E £ * . ** ) + E £ * ' *•* ) - E £ * w > *) ■ w 

*=i 

Let 



t=i t=i t=i 



b t (A) = e^,«*)-E^ >"•*), 

t=l t=l 

T T 

C T ( A, tt) = E U {x? , 7T t ) - E *t{& , tt) . 



Notice that the choice of policies has no influence over future losses in Ct{A, tt). Thus, Ct{A, tt) 
can be bounded by a specific reduction to full information online learning algorithms (to be specified 
later). Also, notice that the competitor policy tt does not appear in Bt{A). In fact, Bt{A) depends 
only on the algorithm A. We will show that if algorithm A and the class of models satisfy the 
following two "smoothness" assumptions, then Bt {A) can be bounded by a sublinear term. 

Assumption Al Rarely Changing Policies Let at be the probability that algorithm A changes 
its policy at round t. There exists a constant D such that for any 1 < t < T, any sequence of models 
mi, ■ ■ ■ ,nit and loss functions £\, . . . ,£t, at < D/y/t. 
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N: number of experts, T: number of rounds. 
Initialize Wi q = 1 for each expert i. 
W = N. 

for t := 1,2,... do 

For any i, p ljt = w itt -i/W t -i. 

Draw I t such that for any i, ¥ (I t = i) — Pi.t- 

Choose the action suggested by expert I t . 

The adversary chooses loss function c t . 

The learner suffers loss c t (/ t ). 

For expert i, Wi t t = Wi,t-i£~ r,Ct<<l - 

end for 



Figure 2: The EWA Algorithm 



N: number of experts, T: number of rounds. 
r ? = min{ v /logiV/T J l/2}. 
Initialize w i0 = I for each expert i. 
Wo = N. 

for t := 1,2,... do 

For any i, p ift = w jjt _i/W t _i. 

With probability (3t — Wi t _ l! t—i/wi t _ u t—2 choose the previously selected 

expert, It = It-i and with probability 1 — /? t , choose It based on the 

distribution q t = (pi, t , . . .,PN,t)- 

Learner takes the action suggested by expert It- 

The adversary chooses loss function ct ■ 

The learner suffers loss Ct(It)- 

For all experts i, w iit = u>j it _i(l — r]) Ct ^\ 

w t = sili^t- 

end for 



Figure 3: The Shrinking Dartboard Algorithm 



Assumption A2 Uniform Mixing There exists a constant r > such that for all distributions 
d and d! over the state space, any deterministic policy 7r, and any model m £ M, 

||dP(7r,m)-d'P(7r,m)|| 1 < e~ 1/r \\d - d'^ . 

As discussed by iNeu et al.l (|201dt) . if Assumption IA2I holds for deterministic policies, then it holds 
for all policies. 

3.1 Full Information Algorithms 

We would like to have a full information online learning algorithm that rarely changes its policy. 
The first c andidate that we consider is the well-known Exponentially Weighted Average (EWA) 
algorithm (|Vovkl . 119901 iLittlestone and Warmuthl . 1 1994T ) shown in Figure [21 In our MDP problem, 
the EWA algorithm chooses a policy it £ H according to distribution 

g t (7r)ocexp^-A^E[4«,7r)]V A>0, (2) 

The policies that this EWA algorithm generates most likely are different in consecutive rounds and 
thus, the EWA algorithm might change its pol icy frequently. However, a variant of EWA, called 
Shrinking Dartboard (SD) (jGeulen et all I2010T ) and shown in Figure O satisfies Assumption IA1I 
Our algorithm, called SD-MDP, is based on the SD algorithm and is shown in Figure |4j Notice 
that the algorithm needs to know the number of rounds, T, in advance. 
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T: number of rounds . 

»7 = miii{Vlog|n| /T, 1/2}. 

For all policies ir G {1, . . . , |II|}, w Tj o = 1. 

for i := 1,2,... do 

For any w, p n ,t = w^.t-iM-i- 

With probability fit = u> 7rt _ li t_i/u' 7rt _ li t_2 choose the previous policy, ir t = 
7r f _i, while with probability 1 — f3 t , choose ir t based on the distribution 
qt = (pi,t, ■ • • ,P|n|,t)- 
Learner takes the action a t ~ 7r t (.|x t ) 

Adversary chooses transition model mj and loss function £ t . 

Learner suffers loss tt(xt, at)- 

Learner observes mt and it- 

Update state: Xt+i ~ m t {.\xt,ai)- 

For all policies n, w Xt t = ^7r,t-i(l — rj)^ 1 ^'^. 

w t = Exen^.t- 
end for 



Figure 4: SD-MDP: The Shrinking Dartboard Algorithm for Markov Decision Processes 



Consider a basic full information problem with N experts. Let Rt{SD, i) be the regret of the SD 
algorithm with respect to expert i up to time T. We have the following results for the SD algorithm. 

Theorem 1. For any expert i G {1, . . . , N}, 



R T {SD,i) < 4^/TlogN + \ogN , 

and also for any 1 < t < T, 

/log TV 

P {Switch at time t) < y — — — . 

Proof. The proof of the regret bound can be found in (jGeulen et all 120101 Th eorem 3). The proo f 
of the bound on the probabilit y of switch is s i milar to the proof of Lemma 2 in (|Geulen et all |2010( ) 
and is as follows: As shown in (jGeulen et all [2010l Lemma 2), the probability of switch at time t is 

Wt-i - W t 
Wt-i 

Thus, Wt = (1 — ctt)Wt-i- Because the loss function is bounded in [0, 1], we have that 

N N N 

w t = Y; W ^ = X) 1 "*.*-^ 1 - v) Ct{l) > Xl^-t-iC 1 - v) = (i - vWt-i . 

i—l z—1 i— 1 

Thus, 1 — at > 1 — 77, and thus, 



/log TV 
a t <V< \I—jT- 



□ 



3.2 Analysis of the SD-MDP Algorithm 

The main result of this section is the following regret bound for the SD-MDP algorithm. 

Theorem 2. Let the loss functions selected by the adversary be bounded in [0, 1], and the transition 
models selected by the adversary satisfy Assumvtion \AS\ Then, for any policy it G II, 



E[R T (SD-MDP, tt)] < (4 + 2r 2 )v/Tlog|n| + log |n| . 

In the rest of this section, we write A to denote the SD-MDP algorithm. For the proof we use 
the regret decomposition (Q]): 

R T {A,n) = B t {A) + C t {A,tt) . 
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3.2.1 Bounding E [C T (A, tt)) 
Lemma 3. For any policy tt £ II. 



E[Ct(A,tt)} = E 



t=i 



1=1 



<4 v /riog|n| + io g |n| 



Proof. Consider the following imaginary game between a learner and an adversary: we have a set 
of experts (policies) II = {tt 1 , . . . , 7rl n l}. At round t, the adversary chooses a loss vector ct € [0, l] n , 
whose ith. element determines the loss of expert tt 1 at this round. The learner chooses a distribution 
over experts q t (defined by the SD algorithm), from which it draws an expert 7r t . Next, the learner 
observes the loss function a- From the regret bound for the SD algorithm (Theorem [TJ , it is 
guaranteed that for any expert tt, 

T T 

J2(ct, qt) - E C *W ^ V^log \H\ + log |H| . 

t=l t=l 

Next, we determine how the adversary chooses the loss vector. At time t, the adversary chooses a 
loss function l t and sets ctirr 1 ) = E £t{x* ,tt' 1 ) ■ Noting that (ct,qt) = E [^(x^* , 7rt)] and ct(n) = 



E \t t (x% ,tt)] finishes the proof. 

3.2.2 Bounding E [By (A)] 

First, we prove the following two lemmas. 

Lemma 4. For any state distribution d, any transition model m, and any policies tt and tt' , 

\\dP(7T,m)-dP(TT , ,m)\\ 1 < Htt-tt'II^ . 

Proof. Proof is easy and can be found in ([Even-Par et al.l . |2009() . Lemma 5.1. 



□ 



Lemma 5. Let at be the probability of a policy switch at time t. Then, a t < yk>S |II|/T. 
Proof. Proof is identical to the proof of Theorem [TJ 
Lemma 6. We have that 



□ 



□ 



E[B T (A)} = E 



E^oo-ew.^) 



t=i 



< 2rVlog|n|T. 



Proof. Let Tt = (j(tti, . . . ,TT t ). Notice that the choice of policies are independent of the state 
variables. We can write 



E[B T (A)] = E 



= E 



E 



E 



< E 



,t=l x£X 
T 



EM^aO-Ew^) 
t=i t=i 

T 

E E 0w=*> - J {*i*=*} 

t=l x£X 
T 

t=l x£X 

t=i 

T 

E 



e t {x,TT t (x)) 

£ t (x,TT t (x)) 
{xf=x} - ^{xt*=x} 



T T 
Tt 



< 



t=i 

T 



{xt=x} -U{<<=*} 
Ell £ *HoolK-^lll 

t=l 

T 

Eii M *~ wt '*iii 



Ti 



(3) 
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I{xJ«=a;} 



where u s — E [l{ x ^ =x } I^t] is the distribution of xf for s < t and v S) t = E 
distribution of x™ ( for s < iQ Let i? t be the event of a policy switch at time t. From inequality 

t 

Ikt-fe - TtlL.l < ||7T t _fe - TTt-k+lWoo^ H h Ikt-l - TTt ||oo,l ^ 2 X] ^l 

s=t— fe+1 



is the 



and Lemma [5j we get that 



E 



\TT t _ k - 7T t 



Let = P{it,m t ). We have that 





S 


\\u t - 




- «t- 




IJ 








S 


\\u t - 




- Ut- 




I- 14 






< 


E 


\\u t - 






-1^-1 II 


i + 


Iht-iP^x - 


-Ut-l.tP^il^ 


< 


E 


Ikt- 


1 - 7T t 


oo,l 


e" 1 ^ | 


U t - 


1 - Wt-l.tll! 




< 


E 


Ikt- 


1 - 7T t 


oo,l ' 


e" 1 /^ 


\u t - 


p^t-a ,, pw t II 
-2^-2 - W t -2-r t _ 2 ||l 



< E 

< . . 



+ \\ut-2P?l 2 -V t -2,tP?-l 2 \\l 
Kt-1 " Ttlloo.l + e_1/T IKt-2 - Ttlloo.l + e ~ 2/T H M t-2 ~ ^-2,*lll 



(4) 



< e" fc/r E 



fc=0 

<±2e 

fc=0 



oo, 1 



e * /t ||it - wo.tllj 



log |n| 



r 



fc + By gj 



(5) 



where we have used the fact that ||iio — u o,t||i = 0, because the initial distributions are identical. By 
(|5l) and ([3]), we get that 



E [B T (A)] < 2r 2 J2 V ^ = 2 - 2 v/lofflT 



□ 



What makes the analysis possible is the fact that all policies mix no matter what transition 
model is played by the adversary. 



Proof of Theorem [H The result is obvious by Lemmas [3] and [6] 

The next corollary extends the result of Theorem [5] to continuous policy spaces. 



□ 



Corollary 7. Let II be an arbitrary policy space, Af(e) be the e-covering number of space (II, || . || 1 ), 
andC(e) be an e-cover. Assume that we run the SD-MDP algorithm onC(e). Then, under the same 
assumptions as in Theorem^ for any policy ir G II, 



E [P T (SD-MDP, ti-)] < (4 + 2r 2 )^Tlog7V(e) + logAf(e) + tTe . 



1 Notice that Tt contains only policies, which are independent of the state variables. 



6 



Proof. Let Lt(tt) = E J2t=i ^t(x^,Tr) the value of policy it. Let 1^(2:) = P {xj = x). First, 
we prove that the value function is Lipschitz with Lipschitz constant tT. The argument is similar 
to the argument in the proof of Lemma [6] For any -k\ and ~ni , 



\Lt(ki) - L t (tt 2 )\ 



< 2 



< 2 



t=l t=l 
t=l 

T 

\\u* u t - "jr 3 ,tiii 



With an argument similar to the one in the proof of Lemma [6l we can show that 

IK llt -Un^tW-L <r||7ri -Trall^j . 

Thus, 



oo,l 



Given this and the fact that for any policy it G LT, there is a policy 7r' G C(e) such that || tt — 7r'|| , < 
e, we get that 



E[i? T (SD-MDP,7r)] < (4 + 2r 2 ) ^/T\ogN{e) + logAA(e) + rTe . 



In particular if LT is the space of all policies, Af(e) < (IvAj/e)' 71 "'*', so regret is no more than 



□ 



\A\ 



\A\ 



,[i? T (SD-MDP,7r)] < (4 + 2x 2 )^ |^| log + \A\ \X\ log ^ + rTe 
By the choice of e = i, we get that E [i? T (SD-MDP, vr)] = 0{t 2 ^JT \A\ \X \\og{\ A\T)). 
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